Predict¶

Predict functionality lets the user perform the model building on the Target and Investment (or Impression) for all the dimensions in the dataset. The user has the option of considering seasonality in the model-building process. This tab displays statistics and charts for users to understand how well the model fitting took place for all the dimensions.

Input for Predict Functionality¶

  • Seasonal Effects: User input is required to include seasonality effects in the model:
    • Non-Seasonality Selection: In the case of non-seasonality selection, the user needs to provide a date range for each dimension for which the model has to be trained.
      • Note:
        • The date range for each dimension can be entered by clicking on the plus icon on the left panel.
        • If the date range is not provided, the default date range (dimension's start and end date) will be selected for that dimension based on its historic data.
    • Seasonality Selection: For seasonality selection entire date duration is selected, to include the effects of seasonality.
      • Note: Seasonality is taken into consideration by the model when the data is available for the complete two cycles, this is applicable for both day and month-level seasonalities. In the case of daily data granularity, if data is less than 2 years then only day-level seasonality will be considered by the model.

  • Median/Mean Selection: User can select either median or mean from the drop-down, this selection is required for calculating bounds/constraints for the optimization process on optimize and goal seek tabs.
    • Note:
      • The bounds computation: Maximum as 3 x Median/Mean Spend of dimension, Minimum: 0
      • The option selected will be displayed by default as points on the response curves chart.

Note on Predict Functionality¶

  • Predict functionality would work once the explore functionality is run successfully.
  • In some scenarios, a dimension is discarded by the model during the prediction process or can be discarded by the user itself if they don't want to include a particular dimension in the optimize and goal seek functionalities.

Screenshot%20%2828%29.jpeg

Screenshot%20%2829%29.jpeg

Results from Predict Functionality¶

Response Curves for Dimensions¶

  • Predict page displays the trend chart which has the following views based on user selection:
    • Investment vs. Predictions or Impression vs. Predictions
    • Investment vs. CPA/ROI or Impression vs. Conversion Rate
      • Note:
        • Y-axis can be switched from Predictions to CPA/ROI/Conversion Rate or vice versa using Y-axis Selector on the chart.
        • Dimension selection remains intact when the user switches between the Y-axis.
        • CPA or ROI is calculated based on the 'Type of Target' selection on explore tab.
        • In the case of impressions, regardless of the 'Type of Target' selection, the y-axis will always be 'Conversion Rate'. The formula used for calculation: Predictions/(Impressions x 1000).
  • User can select either Median/Mean Points or both checkboxes to view respective historic data points on the chart.
    • Note:
      • By default, the option selected in the Median/Mean selector on the left-hand panel will be displayed on the chart.
      • In case the user doesn't want to view either of the data points, both checkboxes can be unselected.
  • Seasonality effect will be included in the chart based on the option selected by the user.
  • If the user wants to analyze the trend for a particular set of dimensions, they can do so by using the drop-down multi-select option provided in the chart itself.
  • The trend chart displays only those data points which are included in the selected date range for each dimension.

Screenshot%20%2835%29.jpeg

Screenshot%20%2836%29.jpeg

Screenshot%20%2837%29.jpeg

Screenshot%20%2845%29.jpeg

Chart comparing Actual vs. Prediction on Target variable¶

  • Predict page also compares the trend between Actual and Prediction on Target variable for each dimension.
  • Weekly and annual seasonality charts will be included (and viewed using the horizontal scroll option) based on the option selected by the user.
  • If the user wants to analyze the trend for a particular dimension, they can do so using the drop-down option provided in the chart itself.
  • The trend chart displays only those data points which are included in the selected date range for each dimension.
  • The chart shows how well the trained model is able to predict the target variable based on the actual values. The model will be trained on the selected date range for that dimension.
  • It also provides functionality for downloading charts, statistics and response curve equations for all the dimensions.

Screenshot%20%28144%29.png

Screenshot%20%28147%29.png

Screenshot%20%28150%29.png

Statistics based on Actual and Prediction on Target variable¶

  • The predict page shows statistics based on the actual and predictions on the target variable selected by the user in the explore page input panel.
  • These statistics help the user to understand the accuracy of the predictions on the target variable.
  • Following are the statistics included in the Predict page:
    • SMAPE: Symmetric mean absolute percentage error (SMAPE or sMAPE) is an accuracy measure based on percentage (or relative) errors. It is derived by calculating the absolute difference between the Actual and Prediction which is divided by the sum of absolute values of the actual value and the predicted value. The lower the SMAPE value, the higher the accuracy of the model.
    • Correlation: Correlation indicates the relationship between the actual and predicted values. Higher the value, the stronger the relationship between the two variables. It indicates the extent to which the variables increase or decrease in parallel. It is expressed in terms of percentage.
    • Number of Data Points: It tells the total number of data points present for the selected dimension.
    • Data Points post outlier treatment: It tells the total number of data points present after performing outlier treatment for the selected dimension. The Z-score method is used for outlier treatment.
    • % of Data Points discarded during outlier treatment: This metric explains the percentage drop in data points due to outlier treatment for the selected dimension.

Screenshot%20%2841%29.jpeg

Prediction functionality based on User Input¶

  • Predict page has the functionality to let the user make predictions on the Target variable based on the selected dimension.
  • Investment and number of days (for non-seasonality selection) or date range (for seasonality selection) are required as input from the user to get the predicted value.
  • Post entering inputs from the user for the selected dimension, the predicted value will be displayed on clicking the Predict button.

Screenshot%20%2843%29.jpeg

Discarded Dimensions¶

  • In some scenarios, a dimension(s) is discarded either by the prediction model or can be discarded by the user themselves.
  • A dimension can be discarded by the user if they don't want to include it in the optimize and goal seek functionalities. To discard a dimension a user can select a dimension (from the chart comparing Actual vs. Prediction on the Target variable) and click the Discard button.
  • A dimension(s) can be discarded by the model during the Predict process based on multiple conditions. Below are the conditions when a model discards the dimension:
    • Zero value in data points: All the data points in Investment (or Impression) and Target variables for a dimension are zero.
    • Outlier Treatment: Post outlier treatment of a dimension, a few data points remained and the response curve could not be built for the same.
    • No Variation: There is no variation in Investment (or Impression) for a dimension, i.e. with constant investment/impression there exist multiple target (prediction) values.
    • Predictions: If the sum of predictions on the target variable on complete training data has resulted in zero for a dimension, then such dimension(s) is discarded.

Screenshot%20%2842%29.jpeg

Note on discarding the dimension(s) to get optimum result in Optimize Functionality ¶

  • For some dimensions flat response curves will be displayed in the visualization. This is due to a bad fit to the data which is leading to flat curves.
  • The flat curves when used in the Optimize functionality tend to result in suboptimal outputs.
  • The recommendation is to either discard these curves during the visual inspection or enter the minimum budget possible for these dimensions in the Optimize functionality user input.

Backend Logic¶

  • Data Filter: Filter data for the selected date range for each dimension as selected by the user in the input panel on the Predict page. It discards all the data points where both investment (or impression) and target variables have zero values.
    • Assumption: Selected date format is in %m-%d-%y format.
  • Outlier Treatment: It removes outliers from the data for each of the dimensions. It uses the Z-score method to perform outlier treatment.
    • Refer link to know more about Z-score: Link
  • Drop Points: Calculates the number of data points pre and post outlier treatment and the percentage of data points discarded during the process.
  • Seasonality Check: Check if weekly and annual seasonality is applicable for model building.
    • Assumption:
      • For weekly and annual seasonality to be included in the model building at least two complete cycles must be present for both of them.
      • This functionality or check is only applicable for model building with seasonality effects.
  • Fit Curve: Calculates parameters for the model used, predictions on target variable and error metrics for each dimension. Here predictions are made on the target variable (dependent variable) and the investment/impression variable is used for building the model (independent variable) based on user selection on the explorer page.  If the seasonality effects option is selected by the user, additional parameters for weekly and annual seasonality are also included.
    • Model Equation Used: Equation to capture the nonlinear response to media variable on the dependent variable using S Curve (Hill).
      • Equation for Non-Seasonal Model: c * X^a / ( X^a + b^a )
      • Equation for Seasonal Model: ( c * X^a / ( X^a + b^a ) ) + week_coeff + month_coeff
      • Model parameters and their bounds:
        • X: Level of Exposure (Investment/Impression)
        • a: Shape parameter
          • Bound used: 0.5 to 3
        • b: Inflection parameter
          • Bound used: 30 to 100 percentile of Investment/Impression
        • c: Max/Saturation parameter
          • Bound used: zero to positive inf
        • week_coeff: Weekly seasonality parameter, for each day of the week (only applicable for the model with seasonality)
          • Bound used: zero to positive inf
        • month_coeff: Annual seasonality parameter, for each month of the year (only applicable for the model with seasonality)
          • Bound used: zero to positive inf
      • Assumption:
        • If the optimal solution of the model parameter is not found, the dimension is discarded/dropped and included in the list of discarded dimensions.
        • For model with seasonality, if either weekly/annual or both seasonality doesn't exist or is not supported by the data, then respective parameters will not be considered in the model equation. Only weekly seasonality is considered if data is less than 2 years.
        • For model with seasonality, if no weekly seasonality is available in the data for the dimension, it is discarded and added to the drop dimension list.
      • Model Equation Understanding:
        • The model equation used captures the nonlinear response to the media variable on the dependent variable using S-Curve (hill function).
        • It is based on the theory of diminishing returns that each additional unit of investment increases the response but at a declining rate. To understand it, consider on the x-axis there is spend, and on the y-axis, there is response i.e. conversions/target. So as the spend rises, the response changes and depending on the curve we understand what the marginal response is.
        • At some point, saturation is reached where the same increase in the spend doesn’t yield the same proportion of conversions/target. These curves help to understand how we can optimally allocate budgets between all of our media channels.
    • Model Building Package: Package used for finding S-Curve optimal parameters for each dimension using non-linear least squares.
      • Package: scipy.optimze.curve_fit
      • Input Parameters: scipy.optimize.curve_fit(func, xdata, ydata, bounds, method)
        • func: The model equation/function used (S-Curve (Hill) transformation)
        • xdata: The independent variable where the data is measured (investment/impression variable)
        • ydata: The dependent data (target variable)
        • bounds: Lower and upper bounds on parameters (S-Curve (Hill) function model parameters and their bounds)
        • method: TRF (Trust Region Reflective)
      • Refer link to know more about Model Building Package (scipy.optimze.curve_fit): Link
    • Output obtained from Fit Curve:
      • Parameters for each dimension, which helps in making predictions based on the user inputs and in the Optimize and Goal Seek functionalities (if the seasonality option is selected, it also includes weekly and annual seasonality parameters)
      • Metric/Score (SMAPE, Correlation) for each dimension, which helps in understanding the accuracy of the model for each dimension (Separate functions are created to calculate these metrics. It is calculated based on actual and predicted values on target variable for each dimension)
      • Prediction of target variable on training data, it is used to analyze response curves in predict page for each dimension
      • List of discarded dimension(s), these are the dimensions which got discarded by model (list of conditions based on which a dismension is discarded by the model is mentioned under 'Results from Predict Functionality: Discarded Dimensions' section)
      • Median and Mean historic data points for all the dimensions to be displayed on the Investment (or Impression) vs Predictions chart
  • Result obtained from Predict Functionality: After the result is being generated through Fit Curve logic, a few data cleaning processes are performed. This step is done before the results from Predict functionality are displayed to the user and some required outputs are passed to the Optimize and Goal Seek functionalities. Following is the list of output generated from Predict functionality:
    • Parameters of each dimension
    • Metric/Score (SMAPE, Correlation, Drop Points) for each dimension
    • Prediction of target variable on training data
    • List of Discarded Dimensions
      • Assumption: If the sum of prediction on the target variable on complete training data resulted in zero for a dimension, then such dimension(s) is discarded.
    • Summary statistics on Investment/Impression
    • Median and Mean datapoints for all the dimensions
    • CPM of each dimension, if the impression option is selected by the user on Explore page